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A CENTRAL LIMIT THEOREM FOR 
TEMPORALLY NON-HOMOGENOUS MARKOV CHAINS 
WITH APPLICATIONS TO DYNAMIC PROGRAMMING 

ALESSANDRO ARLOTTO AND J. MICHAEL STEELE 


Abstract. We prove a central limit theorem for a class of additive processes 
that arise naturally in the theory of finite horizon Markov de c ision problems. 
The main theorem generalizes a classic result of iDobrushinl li 19561 ) for tem¬ 
porally non-homogeneous Markov chains, and the principal innovation is that 
here the summands are permitted to depend on both the current state and 
a bounded number of future states of the chain. We show through several 
examples that this added flexibility gives one a direct path to asymptotic nor¬ 
mality of the optimal total reward of finite horizon Markov decision problems. 
The same examples also explain why such results are not easily obtained by 
alternative Markovian techniques such as enlargement of the state space. 
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1. Stochastic Dynamic Programs and Asymptotic Distributions 


In a finite horizon stochastic dynamic program (or Markov decision problem) 
with n periods, it is typical that the decision policy 7r* that maximizes total ex¬ 
pected reward will take actions that depend on both the current state of the system 
and on the number of periods that remain within the horizon. The total reward 
R n { 7r*) that is obtained when one follows the mean-optimal policy 7r* will have 
the expected value that optimality requires, but the actual reward R n {n*) that is 
realized may — or may not — behave in a way that is well summarized by its 
expected value alone. 

As a consequence, a well-founded judgement about the economic value of the 
policy 7r* will typically require a deeper understanding of the random variable 
R n (7r*). One gets meaningful b enefit f rom the knowledge of the variance of R n (ir*) 
or its higher moments ( Arlotto et al. . 2014 ). but, in the most favorable instance, 
one would hope to know the distribution of R n (n*), or at least an asymptotic 
approximation to that distribution. 

Limit theorems for the total reward (or the total cost) of a Markov decision 
problem (or MDP) have been studied extensively, but earlier work has focused 
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almost exclusively on those pr oblems w h ere t he optimal decision policy is stationary. 
The first steps were taken by Mandll ( 1973 . 197431) in the context of finite state 
space MD Ps. Th i s wor k was subsequently refined and e x tended to more general 
MD Ps by Mand ll (Il985h. Mandl and Lau smanoval ( 199lh . iMendoza-Pered ( 2008 ). 
and Mendoza-Perez and Hernandez-Lerma ( 2010h . Through these investigations 
one now has a substantial limit theory for a rich class of MDPs that includes 
infinite-horizon MDPs with discounting and infinite horizon MDPs where one seeks 
to maximize the long-run average reward. 

Distributional properties of MDPs have also been c onsidered i n the de sign o f 
pathw i se asympto t ic optimal controls. Fo r instance. Leizarowitzl ( 19871988), 
Rotar ( 19851 198fi| h Asriev and Rotar! ( 199Clh . Rotar ( 199llh and Belkina and Rotar 
(12005 1 studied controls that produ ce a long-run avera ge reward that is asymptot¬ 
ically optimal almost surely. Also, Leizarowitzl ( 1996i ) investigates pathwise opti¬ 
mality in infinite horizon problems. Rotar! ( 2012h provides a sustained review of 
this literature including a more comprehensive list of references. 

Here the focus is on finite horizon MDPs and, to deal with such problems, one 
needs to break from the framework of stationary decision policies. Moreover, for 
the purpose of the intended applications, it is useful to consider additive functionals 
that are more complex than those that have been considered earlier in the theory 
of temporally non-homogeneous Markov chains. These functionals are defined in 
the next subsection where we also give the statement of our main theorem. 


A Class of MDP Linked Processes 

In the theory of discrete-time finite horizon MDPs, one commonly studies a 
sequence of problems with increasing sizes. Here, it will be convenient to consider 
two parameters, m and n. The parameter m is fixed, and it will be determined by 
the nature of the actions and rewards of the MDP. The parameter n measures the 
size of the MDP; it is essentially the traditional horizon size, but it comes with a 
small twist. 

Now, for a given m and n, we consider an arbitrary sequence of random variables 
{X U} i : 1 < i < n + to} with values in a Borel space A, and we also consider an 
array of n real valued functions of 1 + to variables, 

fn,i ■ * 1+m dR, 1 <i<n. 

Further properties will soon be required for both the random variables and the 
array of functions, but, for the moment, we only note that the random variable of 
most importance to us here is the sum 

n 

(1) = 'y ' Z n i where Z n i = fn : i(X n ^, ..., A n ^_|_ m ). 

i=l 

In a typical MDP application, the random variable Z n ^ has an interpretation as 
a reward for an action taken in period i € {1,2,..., n}. The size parameter n is 
then the number of periods in which decisions are made, and S n is the total reward 
received over all periods i € (1,2,..., n} when one follows the policy Tr n . Here, of 
course, the actions chosen by 7r n are allowed to depend on both the current time 
and the current state. 

The parameter to is new to this formulation, and, as we will shortly explain, 
the flexibility provided by to is precisely what makes sums of the random variables 
Z n ,i = ... ,X U: i +rn ) useful in the theory of MDPs. In the typical finite 
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horizon setting, the index i corresponds to the decision period, and the realized 
reward that is associated with period i may depend on many things. In particular, it 
commonly depends on n, i, the decision period state X n ^. and one or more values of 
the post-decision period realizations of the driving sequence {X n< i : 1 < i < n + m}. 


Requirements on the Driving Sequence 

We always require the driving sequence {X n> i : 1 < i < n + m} to be a Markov 
process, but here the Markov kernel for the transition between time i and i + 1 
is allowed to change as i changes. More precisely, we take B(X) to be the set of 
Borel subsets of the Borel space X, and we define {AAy : 1 < i < n + m} to be 
the temporally non-homogeneous Markov chain that is determined by specifying a 
distribution for the initial value X Uj i and by making the transition from time i to 
time i + 1 in accordance with the Markov transition kernel 

k\™i + i(x, B) = P(X„, 4+ i £ B | X n>i = x), where x £ X and B £ B{X). 


The transition kernels can be quite general, but we do require a condition on 
their minimal ergodic coefficient. Here we f irst recall that for any Markov transition 
kernel K = K(x,dy) on X, the Dobrushir\ contraction coefficient is defined by 


(2) S(K)= sup \K( Xl ,B)-K(x 2 ,B)\, 

X\,X2 GX 

BeB{X) 

and the corresponding ergodic coefficient is given by 


a(K) = l-S(K). 

Further, for an array {K^ +1 : 1 < * < n} of Markov transition kernels on X , the 
minimal ergodic coefficient of the ?r’th row is defined by setting 

(3) a n = min a(K^ ] +1 ). 

l<z<n 

There is also a minor technical point worth noting here. Although we study 
additive functionals that can depend on the full row {X n i : 1 < i < n + m} with 
n + m elements, the last 1 + m elements of the row are used in a way that does not 
require any constraint on the associated ergodic coefficients. Specifically, the last 
1 + m elements of the row are used only to determine value of the time n reward 
that one receives as a consequence of the last decision. It is for this reason that in 
expressions like j3j we need only to consider i in the range from 1 to n — 1. 


Main Result: A CLT for Temporally Non-Homogeneous Markov Chains 

When the sums {£„ : n > 1} defined by (P) are centered and scaled, it is natural 
to expect that, in favorable circumstances, they will converge in distribution to 
the standard Gaussian. The next theorem confirms that this is the case provided 
that one has some modest compatibility between the size of the minimal ergodic 
coefficient a n , the size of the functions f n j, 1 < i < n, and the variance of S n . 

Theorem 1 (CLT for Temporally Non-Homogeneous Markov Chains). If there are 
constants Ci, C 2 ,... such that 

max || f n i || < C n and C^a~ 2 = oCVar^]), 

l<i<n 


( 4 ) 
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then one has the convergence in distribution 
Sn - E[5„] 


(5) 


iV(0,1), as n —»• oo. 


oo. 


VVarf^] 

Corollary 2. If there are constants c > 0 and C < oo such that 
a n > c and C n < C for all n> 1, 
then one has the asymptotic normality (0 whenever Var [,!?„] — > oo as n 

Remark 3 (Boundedness Assumption). One might hope to relax the condition 
in Theorem [T| that for each fixed n > 1 the functions {f n ,i ■ 1 < i < n} are 
uniformly bounded. Even though the oscillation bounds in Section [5] make heavy 
use of the supremum norm, one could conceivably use truncation arguments that 
still give access to effective oscillation bounds. Unfortunately, truncations would 
substantially complicate an argument that is already long, so we have stayed with 
uniform boundedness. In some simpler contexts, it is known that the uniform 
boundedness condition can be releaxed; specifica ll y, there are such relaxations in 
the Markov additive CLTs of Nagaevl ( 19571 1961), .Tones ( 2004 b and Statuliavicus 
dl969fh 


Organization of the Analysis 

Befo re pro ving this theorem, it is useful to note how it compares with the classic 
CLT of Dobrushinl (119561) for non-ho mogeneous Markov chains. If we set m = 0 
in Theorem [I] then we recover the iDobrushinl theorem, so the main issue is to 
understand how one benefits from the possibility of taking m > 1. This is addressed 
in detail in Section [2] and in the examples of Sections [8] and [9] 

After recalling some basic facts about the minimal ergodic coefficient in Section 
[H the proof begins in earnest in Section 0] where we note that there is a martingale 
that one can expect to be a good approximation for S n . The confirmation of the 
approximation is carried out in Sections [5] and [6] In Section[T]we complete the proof 
by showing that the assumptions of our theorem also imply that the approximating 
martingale satisfies the conditions of a basic martingale central limit theorem. 

We then take up applications and examples. In particular, we show in Section [8] 
that Theorem [T] leads to an asymptotic normal law for the optimal total cost of a 
classic dynamic inventory management problem, and in Section 0] we see how the 
theorem can be applied to a well-studied problem in combinatorial optimization. 


2. On m = 0 vs m > 0 and Dobrushin’s CLT 


Dobrushinl (|1956l ) introduced many of the concepts that are central to the theory 


of additive functionals of a non-homo genous Mar kov chain. In addition to intro¬ 
ducing the contraction coefficient ©, Dobrushinl also provided one of the earliest 
— yet most refined — of the CLTs for non-homogenous chains. 

Theorem 4 ( Dobrushin . 19561 ). If there are constants Ci, C 2 , ■ ■ ■ such that 


( 6 ) 


max II fn,i Hoc < C n and C^a n 3 = of Y] Var[/ *(X 

- x i= 1 7 


then for S n = fn,i(X n ,i) one has the asymptotic Gaussian law 

Sn - E[5„] 


^/Var [S n 


N( 0,1), as n —» 00 . 
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Aft er Dobrushin’s work there were refinemen ts and extensions by ISarvmsakov 
(1961), Hanenl (1 1963 b and Statuliavicusl (1 19 fill . but the wo r k tha t is closest to 
the approach taken here is that of lSethuraman and Varadhan (120051) . They used a 
martingale approximation to give a streamlined proof of Dobrushinl ’s theorem, and 
they also used spectral theory to prove the variance lower bound 


(7) 


y^Var [f„,i(X n ,j)} ) <Var[S„], 


i=l 


This improves a lower bound oflfosifescu and Theodorescu ( 19691 Theorem 1.2.7) 
by a factor of two, and Peli grad (120121 Corollary 15) gives some further refinements. 

There are also upper bounds for the variance of S n in terms of the sum of the 
individual variances and the recipro cal ay 1 of t he mi nimal ergodic coefficient. The 
most recent of these are given by Szewczaki ( 2012 ) where they are used in the 
analysis of continued fraction expansions among other things. 


Comparison of Conditions 

Theorem[l]requires that C 2 a~ 2 = o(Var[5 rl ]) as n —>■ oo — a conditi on that is di¬ 
rectly imposed on the variance of the total sum S n . On the other hand, Dobrushinl’s 
theorem imposes the condition ([6]) on the sum of the variances of the individual 
summands. This difference is not accidental; it actually underscores a notable dis¬ 
tinction between the traditional setting where m = 0 and the present situation 
where m > 1. 

When one has m = 0, the variance lower bound 0 tells us that condition ® 
of Theorem d] implies condition ® of Theorem [I] but, when m > 1, there is not 
any analog to the lower bound 0. This is the nuance that forces us to impose an 
explicit condition on the variance of the sum S n in Theorem [l] 

A simple example can be used to illustrate the point. We take m = 1 and for 
each n > 1 we consider a sequence A' n ,i, A' n> 2 , • ■ ■, A„ jrl+ i of independent identi¬ 
cally distributed random variables with 0 < Var[A' n) i] < oo. The minimal ergodic 
coefficient in this case is just a n = 1. Next, for 1 < i < n we consider the function 


fn,i( x >y ) = 


x if * is even 


—y if i is odd; 
we then set So = 0, and, more generally, we let 


Sn — ^ ^ fn,i ( , A n ^-(- 1 ) ■ 


i=l 


Now, for each n > 0 we see that cancellations in the sum give us S' 2 n = 0 and 
S 2 n+i = — ^ 2 n+i, 2 (n+i), so, according to parity we find 

Var[S 2 „] = 0 and Var[5 2 „+i] = Var[A na ], 

In particular, we have VarfSVj] = 0(1) for all n > 1, while, on the other hand, for 
the sum of the individual variances we have that 

n 

Y,V™[f n AX n ,i,X n , i+1 )\ = nVar[A„,i] = D(n). 

2=1 

The bottom line is that when m > 1, there is no analog of the lower bound 0, 
and, as a consequence, a result like Theorem|T|needs to impose an explicit condition 
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on Var [SVi] rather than a condition on the sum of the variances of the individual 
summands. 

Two Related Alternatives 

One might hope to prove_Theorem[l]by considering an enlarged state space where 
one could first apply Dobrushlnl ’s CLT (Theorem O and then extract Theorem |T| 
as a consequence. For example, given the conditions of Theorem [T] with m = 1, one 
might introduce the bivariate chain {X nt i = (X n y, : 1 < i < n| with the 

hope of extracting the conclusion of Theorem [T] by applying iDobrushin ’s theorem 
to {X”. n> i : 1 < i < n}. 

The fly in the ointment is that the resulting bivariate chain can be degenerate in 
the sense that the minimal erg odic coeffic ient of the chain {X Ut i : 1 < i < n} can 
equal zero. In such a situation, iDobrushinl ’s theorem does not apply to the process 
{X n i : 1 < * < n}, even though Theorem |T] may still provide a useful central limit 
theorem. We give two concrete examples of this phenomenon in Sections [8] and [9] 

A further way to try to rehabilitate the possibility of using the bivariate chain 
{X n< i : 1 < i < n} is to appeal to theorems where the minimal erg o dic co efficient a n 
is replaced with some less fr agile q uantity. For example. IPeli grad (2012) has proved 
that one can replace a n in iDobrushinl ’s theorem with the maximal coefficient of 
correlation p n . Since one always has p n < \/l — an, Peligracf s CLT is guaranteed 
to apply at least as widely as iDobrushinl ’s CLT. Nevertheless, the examples of 
Sections [S] and [9] both show that this refinement still does not help. 

3. On Contractions and Oscillations 

To prove Theorem HI we need to assemble a few properti es of the Dobrushin] 
contract i on co efficient. Much more can be fou nd in Senetal (|200 (tL Section 4.3), 
Winkler ( 2003 . Section 4.2), or lDel Moral ( 2004 . Chapter 4). 

If p and v are two probability measures, we write || p — v || TV for the total 
variation distance between p and v. iDobrushin ’s coefficient ([2]) can then be written 
as 

6(K)= sup || K(x lr ) - K(x 2 ,-) Htv> 

Xi,X2£X 

and one always has 0 < 6(K) < 1. For any two Markov kernels K i and K 2 on A, 
we also set 


(K 1 K 2 )(x,B) = J Ki(x, dz)K2(z, B), 


so {KiK 2 ){x,B) represents the probability that one ends up in B given that one 
starts at x and takes two steps: the first governed by the transition kerne l K\ 
and the second governed by the kernel K 2 . A crucial property of the D obrushin 
coefficient 8 is that one has the product inequality 

( 8 ) 8{K X K 2 ) < 8(Ki)S(K 2 ). 

Now, given any array : 1 < i < n} of Markov kernels and any pair of 

times 1 < i < j < n, one can form the multi-step transition kernel 

K^(x, B) = (K^ +1 K^\. +2 ■ ■ ■ B ), 


(n) 

and, as the notation suggests, the kernel K\ X can change as i changes. The 
product inequality (JHJ and the definition of the minimal ergodic coefficient ([3|) then 
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tell us 

(9) 


S(K^) < (1 - a n y 


for all 1 < i < j < n. 


iDobrusliin ’s coefficient can also be characterized by the action of the Markov 
kernel on a natural function class. First, for any bounded measurable function 
h : X —> R we note that the operator 

(Kh)(x) = J K(x,dz)h(z), 

is well defined, and one also has that the oscillation of h 

Osc (h) = sup | h{z\) — h{zf) | < oo. 

Zl,Z2€iX 


Now, if one sets T~L = {h : Osc (h) < 1}, then the iDobrushin contraction coefficient 
m has a second characterization, 

S(K) = sup | (Kh)(xi) - (Kh)(x 2 )|. 

Xl,X2&X 

hen 

This tells us in turn that for any Markov transition kernel K on X and for any 
bounded measurable function h : X — \ R, one has the oscillation inequality 


( 10 ) 


Osc(A'h) < 5(K) Osc (h). 


This bound is especially useful when it is applied to the multi-step kernel given 
by i+2 ■ ■ ■ j. In this case, the oscillation inequality (flUl) 

and the upper bound © combine to give us 

(11) Osc (Kffh) < 5(K% ] ) Osc (h) < (1 - Osc (h). 

This basic bound will be used many times in the analysis of Section [5] 

4. Connecting a Martingale to S n 

Our proof of Theorem [1] exploit s a martingale appro ximation lik e the one used by 
Sethuraman and Varadhan ( 20051) in their pr oof of th e Dobrushin central limit the¬ 
orem Closely relate d plans have been use d by Gordin ( 19691), Kipnis and Yaradhanl 
( 1986 ). Kifei j|l998 ). Wu and Woodroofe (12004 ). Gordin and Peligradl ( 201 ill , and 
Peligradf i 20121) . but prior to Sethuraman and VaradhaiiT^OOS ) the martingale ap¬ 
proximation method seems to have been used only for stationary processes. 

Here we only need a basic version of the CLT for an array of martingale difference 
sequences (MDS) that we frame as a proposition. This versio n is eas ily c overed by 
any of the mart i ngale central limit theorems of Brown ( 197ll) , iMcLeishl ( 1974h , or 
Hall and Hevdel |l980, Corollary 3.1). 


Proposition 5 (Basic CLT for MDS Arrays). If for each n > 1, one has a 
martingale difference sequence {£ n i : 1 < i < n} with respect to the filtration 
{Gn,i : 0 < * < n}, and if one also has the negligibility condition 

(12) max || || —> 0 asn->oo, 

l<i<n 

then the “weak law of large numbers” for the conditional variances 


(13) 




1 


i= 1 




































































implies that one has convergence in distribution to a standard normal, 

n 

£,n,i =>- iV(0,1) as n ^ oo. 

i=1 

A Martingale for a Non-Homogenous Chain 

We let T n ,o be the trivial cr-field, and we set X n ,i = &{X nt i, X nj 2 ,..., A„,i} for 
1 < * < n + 77i. Further, we define the value to-go process {V ni i : m < i < n + to} 
by setting V n ^ n+rn = 0 and by letting 

n 

(14) V n>i = E [ Z n,j | Xn,i]: for TO < 7 < 71 + TO. 

j=i+l —m 

If we view the random variable Z n j as a reward that we receive at time j, then 
the value to-go V n ^ at time i is the conditional expectation at time i of the total of 
the rewards that stand to be collected during the time interval {i + 1 — m,... ,n}. 
For l + TO<i<n + TO we then let 


(15) d n ^i — Vn,i Fn,i—1 T Z 

and one can check directly from the definition that {d U} i : 1 -|- m < i < n + m} 
is a martingale difference sequence (MDS) with respect to its natural filtration 
{J- n ,i : 1 + m < i < n + to}. 

When we sum the terms of m, the summands 1— V n ,%-i telescope, and we 
are left with the basic decomposition 


(16) 


h>n — ^ ) Z n ^ 
i=l 


n-\-m 

— Fn,m 4“ ^ ) d n 

i— 1+m 


For the proof of Theorem [TJ we assume without loss of generality that = 0 

for all 1 < i < n. Naturally, in this case we also have EjAn] = E[V)i jm ] = 0 since the 
sum of the martingale differences in (1161) will always have total expectation zero. 
We now just need to analyze the components of the representation (fl()l) . 


5. Oscillation Estimates 


The first step in the proof of Theorem |T] is to argue that the summand V n ^ m in 
m makes a contribution to S n that is asymptotically negligible when compared 
to the standard deviation of S n . Once this is done, one can use the martingale 
CLT to deal with the last sum in m- Both of these step s depe nd on oscillation 
estimates that exploit the multiplicative bound (1111) on the Dobrushini contraction 
coefficient. 

For any random variable X one has the trivial bound 


(17) Osc(X) = esssup(X) — essinf(X) < 2|| X H^, 

together with its partial converse, 


(18) || X — E[X] < Osc(X). 

Moreover for any two cr-fields I C 1' of the Borel sets B{X), the conditional expec¬ 
tation is a contraction for the oscillation semi-norm; that is, one has 

(19) Osc(E[X | J]) < Osc(E[X \ l']) < Osc(X). 
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Also, by comparison of X{uj)Y{uj) and X(a/)F(a/), one has the product rule 

(20) Osc(xy) < || X IU Osc(F) + II Y IL Osc(X). 

In the next two lemmas we assume that there is a constant C n < oo such that 
II f n ,i Hoo < C n for all 1 < i < n. 

Since Z n>i = f n ,i(X n>i , ■ • •, X n,i+ m ) and E [Z n4 \ = 0, this assumption gives us 

(21) \\Z n ,i\\oo< C n, and Osc(E[Z nii |I]) < 2C n 
for any cr-field I C B{X). 

Oscillation Bounds on Conditional Moments 

Lemma 6 (Conditional Moments). For all 1 < i < j < n one has 

(22) || E[Z n j | F n ,i] IL < Osc(E [Z nJ | J^]) < 2C n (l - 
and 

(23) Osc(E [Zl d | F n ,i}) < 2Cl{l - a n ) j -\ 

Proof. Since E [Z n j \ J r n j] has mean zero, the first inequality of (1221) is immediate 
from m- To get the second inequality, we note by the Markov property that we 
can define a function hj on the support of X n j by setting 

hj (X n j ) = E [Z n j | F'n.j \ • 

and by (12T1) we have the bound Osc (hj) < 2 C n . For i < j a second use of the 
Markov property gives us the pullback identity 

E[Z nJ \F n , i ] = (K^)h j )(X n , i ), 

so the bound m gives us 

Osc (K$hj) < 2C„(1 - a n y-\ 

and this is all we need to complete the proof of (l22l) . 

One can prove (1231) by essentially the same method, but now we define a map 
x i ^ Sj ( x) by setting 

S j(X n ,j) = E[Z n j | Fn,j], 
so for i < j the pullback identity becomes 

E[Zl j \Fn,i] = (K^s j )(X nti ). 

By m we have Osc^-) < Osc so (12T1) implies Osc (sj) < 2 C^, and the 

inequality m then gives us O- □ 

Oscillation Bounds on Conditional Cross Moments 

The minimal ergodic coefficient a n can also be used to control the oscillation of 
the conditional expectations of the products Z n jZ n ,k given F n ,i- All of the inequal¬ 
ities that we need tell a similar story, but the specific bounds have an inescapable 
dependence on the relative values of i, j, k, n, and m. Figure |T] gives a graphical 
representation of the constraints on the indices that feature in the next lemma. 
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Figure 1 . Cross Moments Index Relations 



The estimates in Lemma[7] require attention to certain ranges of indices. In turn, these amount to 
a decomposition of the lattice triangle defined by the upper-left half of {1, 2,..., n} X {1, 2,..., n}. 


Lemma 7 (Conditional Cross Moments). For each i £ {m,..., n + m} we consider 
i — m < j < n and j < k < n. We then have the following oscillation bounds that 
depend on the range of the indices (see also Figure\T\): 

Range 1. If j < i and k < j + m then 

(24) Osc(E [Z n< jZ n , k | JF n ,i]) < 4 Cl 


Range 2. 

(25) 

Range 3. 

(26) 


Ifj<i<j + rn<k then 

Osc(E [Z n ,jZ n , k | J- nii ]) < 6C“(1 - a„) H '- m 

V * < j < k < j + m then 

Osc(E [Z nd Z n>k | F n ,i]) < 2C 2 n (l - a n y-\ 


Range 4. If i<j<j + m< k, then 
(27) Osc(E [Z nJ Z n , k | F n ,i\) < 6C“(1 - a„) fe - i " m 


Proof. Inequality (l24ll follows immediately from the product rule (1201) and the 
bounds Hm . To prove (l25ll . we note that for i < j + m we have T n .i C J r n j+ rn so 
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from the monotonicity (1191) and the fact that Z n j is JF n ,j+ m -measurable, we obtain 
that 

Osc(]E [Z n jZ n ,k | n.,zj) ^ Osc(E [Z n jZ n ^ \J ~ = Osc(^ rl) j]E[^ rt) / E: | J~ n,j+m\)- 

The product rule (l20l) applied to the quantity on the right-hand side above gives 
us the inequality 

Osc(E [Z n jZ nyk | F nji ]) 

— II Z n> j 11^ Osc(E[Z n> fc | J- nj+m]) + Osc(Z n j)\\ E [Z Ui k I T n ,j+m\ IIqq, 

so if we recall that || Z ni || <C n and that Osc (Z n j) < 2 C n and use the conditional 
moment bounds in (1221) we have 

Osc(E [Z nj Z ntk | F nti ]) < 2C 2 n {l - a n ) k ~ j ~ m + 4C*(1 - a n ) k ~ j ~ m , 
completing the proof of (l25l) . 

To verify inequality (l26l) . we consider the map X n j K > pj(X n j) given by 
Pj(Xn,j) = ^[ZnjZn^ \ rz, ] 5 
and we note that for i < j we have the pullback identity 

E[Z ntJ Z n ,k\F n A = (K$ Pj )(X nii ). 

Since || Z„j and || Z n>k are bounded by C n , we have WpjW^ < C% and 
Osc (pj) < 2 C%. We also have i < j < k so (fill) tells us that 

Osc {K\%) < 6(K$) Osc(pj) < 2(%(1 - a n y~\ 

completing the proof of (1^51) . 

Finally, for the last inequality (1271) we have j < j + m < k, we consider the map 
X n j H> qj(X n j) defined by setting 

Qj(Xn,j) = E[Z n) j(E[Z n> jt | J’nj+rre]) I F n j\, 

and we obtain the identity 

E[Z nJ Z„ ik \F n d = (K^ qj )(X nti ). 

By the multiplicative bound m, this gives us 

Osc(E[Z n jZ„ )fc | F n ,i}) < (1 - a n y~ l Osc (qj), 

and we also have Osc (qj) < 60^(1 — by (1751) . so the proof of (1271) is also 

complete. □ 


6 . The Value To-Go Process and MDS L°°-Bounds 

We have everything we need to argue that the variance condition (U) implies 
the negligibility condition m- The first step is to get simple L°°-estimates of the 
value to-go V n y that was defined in (fl4l) . We then need estimates of the martingale 
differenced™,; defined in (fl5l) . Here, and subsequently, we use M = M(mn) to denote 
a Hardy-style constant which depends only on m and which may change from one 
line to the next. 
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Lemma 8 (L°°-Bounds for the Value To-Go and for the MDS). There is a constant 
M < oo such that for all n > 1 we have 

(28) || V n ,i < MCnaf 1 , for m < i < n + m, and 

(29) || dn ti < MCnOif 1 , for l + m<i<n + m. 

Proof. We have || Z n j || < C n , and when we use this estimate on the first m 

summands in the definition (THl) of the value to-go V n ,i we get the bound 

n 

|| Vn,i Hoc <mC n + J2 WnZn^Tn.iWU- 
j=i +1 

From (l22l) we know that || E [Z n j \ T n ,i\ < 2C'„(1 — a„) J_1 for all 1 < i < j < n 
so, after completing the geometric series, we have 

II V nt i < mC n + 2C'„a“ 1 < MCnOtf 1 , 

where one can take M = 2 m as a generous choice for M. This bound, the repre¬ 
sentation (ED, and the triangle inequality then give us (l29l) . □ 

Conditional Variances L 2 -Bounds 

Everything is also in place to show that the variance condition HD gives one the 
weak law of large numbers for the conditional variances ED- We begin by deriving 
some basic inequalities for the variance of S n . 

Lemma 9 (Variance Bounds). For all n > 1 we have 

n+m 

(30) E[S 2 ]=E[V„ 2 J + £ E[<+ and 

n+m 

(31) Var [S n }-MC 2 n a~ 2 < ^ E[d 2 n J < Var[S„], 

j=l-\-m 

Proof. When we square both sides of ED we have 

{ n+m n /■ n+m 

dn,j } + { d n j 

j=l-\-m ^ j—1+m 

Since V ntTn is -F^m-measurable, we obtain from the conditional orthogonality of the 
martingale differences that 

n+m 

E[S 2 n \F n ,m}=Vl m + nd 2 nJ \F n ,m], 

j= 1+m 

and, when we take the total expectation, we then get (l30l) . Finally, since E[S„] = 0, 
the representation ED and the bound (l28l) for || V n>m give us the two inequalities 

of ED- □ 

Lemma 10 (Oscillation Bound). There is a constant M < oo such that 

n+m 

(32) Osc( ^2 E [d^j | T n ,*]) < MC 2 a~ 2 for m < i < n + m. 

3= i+* 
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Proof. If we sum the identity (|15) ) we have 

n+m n+m 

^ ^ Zn,j—m — “1“ ^ ^ 

j=l+i j=l+i 

so, when we square both sides and use the fact that V n ^ is J+j-measurable, the 
orthogonality of the martingale differences gives us 

n+m n 

e [{ £ z nJ _ m } 2 \T n ,i]=vr+ J2 nd 2 „, j \r n ,i\. 

j=l+i j=i +1 

The triangle inequality then implies 

n+m n+m 

(33) Osc ( £ E [d 2 nj | < Osc^ 2 ,) + Osc (E[{ £ | J+]). 

j=i +1 i=l+» 

By K51) we have || I+, < MC n a~ l so, by (1171) . we obtain 

(34) Osc(b+) < 21| vr IL < MCla-\ 

It only remains to estimate the second summand of (1331) . but this takes some 
work. Specifically, we will check that one can write 

n 

(35) Osc(E[{ + } \F < So + Si + S 2 + S 3 + S 4 . 

where So, Si, S 2 , S 3 , and S 4 are non-negative sums that one can estimate individu¬ 
ally with help from our oscillation bounds. Here the first term So accounts for the 
oscillation of the conditional squared moments. It is given by 

i n 

So= Y Osc(E[Z 2 J +„ il ])+ ^ °sc(E[Z 2 J l+J), 
and by m and (G3D we have the estimate 

n 

So < 2 mC 2 n + 2 C 2 n £ (1 - any-* < 2(1 + m)Cla~\ 

j=l+i 

The remaining sums Si, S 2 , S 3 and S 4 are given by the oscillation of the condi¬ 
tional cross moments Z n jZ n ^ given T n ,i where the ranges of the indices j and k 
are given by the corresponding four regions in Figure [T] Specifically, we have 

i j+m 

* = 2 E E Osc(E [Z n jZ n> k I -Sn.i]), 

j=l-\-i—m k=l-\-j 

and (PHI) gives us Si < 8 ? 7 i 2 C 2 since Si has m 2 summands. Next, if we set 


* = *E E 0SC(E [Z nt jZ nt k | J+i]) 

j=l-\-i—m k=l-\-j+m 

then the oscillation inequality PHI) gives us 

i n 

S 2 < 12C 2 Y E (1 - a n ) k - s ~ m < 12mC 2 a- 1 . 

j—l+i—m fc—1+j+m 
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Similarly, for the third region, the bound ([26]) gives us 

n j+m 

^ = 2 E E Osc(E [Z n jZ nt k | T n ,i\) 

j—l+i fc=l+j 

n j+m 

< 4C* X] E (!-^ ^C 2 n a~+ 

j=l+i k=l+j 

and, for the fourth region, the bound (l27l) implies 

n n 

* = * E E Osc(E[Z„ J Z„ i fc | ) 

j=l+i fc=l+j4-m 
n n 

< UC 2 n Y E (! - < 12C^- 2 . 

j=l+i fc=l+j-(-7n 

Finally, by our decomposition (1531) . the upper bounds for So,Si,S 2 ,Ss, and £4 tell 
us that there is a constant M for which we have 

n 

Osc(E[{ Y Zn^flFnA) < MC 2 n a-\ 

j=l+i—m 

so, given (1551) and (1551) . the proof of the lemma is complete. □ 

7. Completion of the Proof of Theorem [1] 

It only remains to argue that if we set 

n+m 

Vi = E [ d l,i | and A n = Y 

i=l+m 

then the variance condition 0 implies that A ra = o(Var[5' rl ]) in probability as 
n —» 00 . We can get this as an easy consequence of the next lemma. 

Lemma 11 (L 2 -Bound for A n ). There is a constant M < 00 depending only on 
m such that for all n > 1 one has the inequality 

n+m 

E[A 2 ] = Var [{ Y E Ki I < MC 2 a~ 2 Var[S„]. 

i=l+m 

Proof. By direct expansion we have 

n+m n+m n+m 

(36) E[A 2 ] = Y Var[? ?i ]+2 Y E [(^ - E h}){ E “ E M)}] > 

i=l+m i=l+m j=i+l 

and we estimate the two sums separately. First, by crude bounds and (1291) we have 

%?] < II Vi lloo E h] < II dn,i IlL nvi] < MC 2 a~ 2 ¥.[vi\, 
so we obtain that the first sum of (1551) satisfies the inequality 

n+m n+m 

Y Var [vi] < MC 2 a ~' 2 Y E fo\- 

i=l+m i=l+m 
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The twin bounds of (f3T|) and the definition r]i =K[d^ li \J r n ,i-i] then tell us that 

n+m 

(37) Var [S n }-MC 2 a~ 2 < £ E[r/»] < Var[5„], 
so we also have the upper bound 

n+m 

(38) ^2 Var[r?j] < MC 2 a~ 2 Var[S , n ]. 

2 =l+m 

To estimate the second sum of (1361) . we first note that r/i is J-^i-i-measurable 
and J- ni i -1 C J-'n.i. so, if we condition on T n , :l we have 

n+m -i r n+m 

(39) E (7 ?i -E[+){ ]T fo-Efo])} = E (^-Efe])E[ ]T I•?+] 

L J=i+1 J L j=i+l 

The definition of r y tells us that 77 ^ — E[?jj] = E[d+ | — E[d+] so, because 

J+ C J nj _i for all * < j, one then has 

n+m n+m 

E[ X] (Vi - Efe]) I Fn,i] = E {E[+ j I J+] - E[<+. 

J=»+l 3=»+l 

These summands have mean zero, so the bound (1181) and the oscillation inequality 
© give us 

n+m 

II E[ £ (% - Efe])|J+] IU < 

i=i+l 

When we use this estimate in (1391) . we see from the non-negativity of rjj and the 
triangle inequality that 

n+m -i 

E (n - Efoi]){ £ ( Vj - Efo])} < MC 2 n a- 2 E[r, z ], 
j=i +1 

so, after summing over i £ {1 + m ,..., n + m} and recalling the second inequality 
of (1371) we obtain 

n+m n+m 

(40) 


< MC 2 a~ 2 Var[5, 


E (dli-EI+H (%- E fe])} 

i— 1 +m L j'=i+l J 

By (1351) . the bounds (1351) and (1701) complete the proof of the lemma. 
Now, at last, we can use the basic decomposition (usd to write 


□ 


in,i+m || Vn,m IIqq ^ 


( 4 i) , = = y ■ _, „. ,_ 

+Var[5„] i/VarfS'n] +Var[5 n ]' 

and it only remains to apply our lemmas. First, from our hypothesis 0 that 
C 2 a~ 2 = o(Var[S n ]) as n — > 00 , we see that the Z/°°-bound || d U: i < MC n a + 
in Lemma [ 8 ] implies the asymptotic negligibility © of the scaled differences 
d n ,i+m/ \/Var[iS , n ], 1 < i < n. Second, our hypothesis 0 and the variance bounds 
m imply the asymptotic equivalence 


Va,iiS n ]~yE[d 2 n} 


i+mJ 


as n —> 00 , 
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so the L 2 -inequality in Lemma HT1 tells us that the weak law (Thill also holds for the 
scaled martingale differences. 

Taken together, these two observations imply that the first sum on the right-hand 
side of m converges in distribution to a standard normal. Moreover, because 
of the L°°-bound || V n ^ m < MC n a~ x given by (l28l) . the last term in (I4TT) is 
asymptotically negligible. In turn, these observations tell us that 

/.f" i ^ mi) as n —>■ oo, 

VVar[5„J 

and the proof of Theorem |T] is complete. 


8. Dynamic Inventory Management: A Leading Example 


We now consider a classic dynamic inventory management problem where one has 
n periods and n independent demands D\,Di,... ,D n . We assume that demands 
all have the same density ip, and that this density has support on a bounded interval 
contained in [0,oo). 

In each period 1 < i < n one knows the current level of inventory x, and the 
task is to decide the level of inventory y > x that one wants to hold after an order 
is placed and fulfilled. Here it is also useful to allow for x to be negative, and, in 
that case, |x| would represent the level of backlogged demand. To stay mindful of 
this possibility, we sometimes call x the generalized inventory level. 

We further assume that orders are fulfilled instantaneously at a cost that is 
proportional to the ordered quantity; so, for example, to move the inventory level 
from x to y > x, one places an order of size y — x and incurs a purchase cost equal 
to c(y — x) where the multiplicative constant c is a parameter of the model. 

The model also takes into account the cost of either holding physical inventory 
or of managing a backlog. Specifically, if the current generalized inventory is equal 
to x, then the firm incurs additional carrying costs that are given by 


L{x) = 


c h x 


if x > 0 


—c p x if x < 0. 


In other words, if x > 0, then L(x) represents the cost for holding a quantity x 
of inventory from one period to the next, and, if x < 0, then L(x) represents the 
penalty cost for managing a quantity —x > 0 of unmet demand. 

Here we also assume that all unmet demand can be successfully backlogged, so 
customers in one period whose demand is incompletely met will return in successive 
periods until either their demand has been met or until the decision period n is 
completed. If there is still unmet demand at time n, then that demand is lost. 
Finally, we assume that the purchase cost rate c is strictly smaller than the penalty 
rate c p , so it is never optimal to accrue penalty costs when one can place an order. 
Naturally, the manager’s objective is to minimize the total expected inventory costs 
over the decision periods 1,2 ,,n. 

This problem has been widely studied, and, at this point, its formulation as a 


or 


-Pi 

dynamic pro gram is well understood — cf. lBellman et all (|1955l) , Bulinskaya (ll964f) . 


Porteusl ( 2002 . Section 4.2). Specifically, if we let Vk{x) denote the minimal 


expected inventory cost when there are k time periods remaining and when x is 
the current generalized inventory level, then dynamic programming gives us the 
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backwards recursion 


(42) v k (x) = min{c(y - x) + E [L(y - D n _ k+1 )] + E[v k -i(y - D n _ k+1 )]}, 

y>x 

for 1 < k < n, and one computes v k {x) by iteration beginning with vq(x) = 0. 

For this model, it is also well-known that there is a base-stock policy that is 
optimal; specifically, there are non-decreasing values 

(43) si < s 2 < • • • < s n 


such that if the current time is i and the current inventory is x, then the optimal 
level 7 n ,i(x) at time i for the inventory after restocking is given by 


(44) 


7 n,i(x) = 


Sn—i+1 
X 


if x < Sn— 2 +i 

if X > Sn—i-\-l’ 


In other words, if at time i the inventory level is below s n _j+i then the optimal 
action is to place an order of size s n _^+i — x, but if the inventor y level is s„, -i+ 1 or 
higher, then the optimal action is to order nothing. Moreover, Bulinskaval ( 1964 
Theorem 1) also showed that for demands with density if) and cumulative distribu¬ 
tion function T, one has for n > 2 that 


(45) 


si = T 


-l 


Cr> C 


and 


< Soo = 'F 


~i 


v Ch i Cp / 

These relations will be important for us later. 


Ch 


A CLT for Optimally Managed Inventory Costs 

To begin, we take the generalized inventory at the beginning of period i = 1 
(before any order is placed) to be A' n ,i = x, where x can be any element of the 
interval [^Soo,Soo]- Subsequently we take X n ^ to be the generalized inventory at 
the beginning of period i £ {2,3,..., n}; so, in view of the base-stock policy (l44ll . 
we have the stochastic recursion 

(46) X n<i+ 1 = 7 n ,i{Xn,i) - Di for all 1 < i < n. 

The key point here is that {X n ,i :l<i<n-|-l}isa temporally non-homogenous 
Markov chain. Moreover, if the support of the demand density ijj is contained in 
[0, J] with 0 < J < oo and if si and Soo are given by m, then by the recursion 
(H51) we can choose the state space X of this chain so that 

(47) XC{-J, Soo }. 

Now, if 7 r* is the policy that minimizes the total expected inventory cost that 
is incurred over n decision periods, then the total cost that is realized when one 
follows the policy 7r* is given by 

n 

(48) C„«) = ^ {c( 7 „,i(X ni< ) - I„,i) + L(X n , i+ 1 )}, 

i—l 

and we see that the total inventory cost C ra ( 7 r*) is a special case of the sum ([1]). To 
spell out the correspondence, we first take m = 1 , and then we take 

fn,i (*£, y) = c("/n t i(x) - x) + L(y ), for 1 < i < n, 

so finally (l46ll gives us the driving Markov chain. 
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Theorem [T] and Corollary[2]now give us a natural path to a central limit theorem 
for the realized optimal inventory cost. We only need to isolate a mild regularity 
condition on the density function ip of the demand distribution ’F. 

Definition 12 (Typical Class). We say that a probability density function ip is in 
the typical class if for each e > 0 there is a w = w{e) such that 

ip(w) — ip(w + e) < 0 for all w < w , and 
ip(w) — ip(w + e) > 0 for all w > w. 

Densities in the typical class include the uniform density on [0, J], the beta(a, /?) 
density with a > 1 and j3 > 1, the exponential densities, and the gamma densities. 
For an example of a density that is not in the typical class, one can take any density 
with two separated modes. Such multi-modal densities are seldom used in demand 
models. 


Theorem 13 (CLT for Mean-Optimal Inventory Cost). If the demand density ip 
is in the typical class and if ip has bounded support, then the inventory cost 
realized under the mean-optimal policy 7r* obeys the asymptotic normal law 


C»«)-E[C.„«)] _ ' A „ n 
^Varp^Tr*)] 


as n —> oo. 


The one-period cost functions in the sum (l48l) are uniformly bounded because 
of the inclusion (l47l) and 0 < J < oo, so two steps are needed to extract this result 
from Theorem [lj First we show that the minimal ergodic coefficient of the Markov 
chain (l46l) is bounded away from zero. Second, we show that the variance of C n {n*) 
goes to infinity as n —> oo. 

After we complete the proof of Theorem 1131 we have two observations. The first 
explains why one cannot prove Theorem [13] by the device of state space extension 
and direct invocation of Dobrushinl ’s theorem. In a nutshell, the issue that if one 
extends the state space then the coefficient of ergodicity can become degenerate. 
The second observation highlights how one still has the conclusion of Theorem [13] 
even for models where there is no immediate fulfillment of placed orders. 


A Uniform Lower Bound for the Minimal Ergodic Coefficients 

To establish a uniform lower bound for the minimal ergodic coefficients of the 
Markov chain (l46l) . we begin with a general lemma which explains the role of the 
class of typical densities. 

Lemma 14 (Total Variation Distance Bound). If the density ip of D\ is in the 
typical class, then for e = | 7 n ,i(a/) — 7 n,i(x)| one has 

(49) sup | K^ ] +1 (a/, B) - K^ +1 (x, B) \ = P(w < D 1 < w + e), 

where w = w(e) is the value guaranteed by Definition\12\ 

Proof. Given x G X and a Borel set B C X, we introduce the Borel set 

B x = 7 n ,i(x) - B , 

so the transition kernel of the Markov chain (TTCJl) can be written as 

Kj" ir(z, B) = P(X„, i+1 G B | X nii = x) = P {D 1 G B x ) = [ if{w) dw. 

JB X 







19 


Without loss of generality we can assume that x < x', so the restocking formula 
& § ives us 7 n,i(x) < 7 n,i(x'), and for e = 7 n ,i(x') - 7 n ,i(x) > 0 we find 

K i n i+i{ x 'i B ) = ^i x n,i +1 e B I = x') = P(A - e e B x ) = f ip(w + e) dw. 

Jb x 

The absolute difference in flUl) is then given by 

I K i% i0', B ) - K iM-i(x, B )\=\ [ V’M dw - f ip(w + e) dw |, 

Jb x Jb x 

and the supremum is attained at B* = {w : ip(w) > ip(w + e)}. Because ip belongs 
to the typical class, Definition [l^] tells us that the integrals over J3* are equal to 
the corresponding integrals over [ui, 00 ). Hence, we have 

/* OO 

SU P I K i% i0'> B ) - K i%i(x, B )\= {^(x) - ip(x + e)} dx 

B&B(X) Jw 

= P(A > w) — P(A — e > w) = P(w? < D\ < w + e), 
just as needed. □ 


Lemma [TT] can be generalized to accommodate multi-modal densities, but since 
such densities are seldom used as models for demand distributions, the simple for¬ 
mulation given here covers all the models one is likely to meet in practice. Moreover, 
the definitions of si and s^ given by (1451) now give us just what we need to make 
good use of our basic bound (TR) 1 ) . 

Lemma 15 . For x,x' £ X and e = 1 771,1(2/) — "f n ,i{x)\ one has 

sup P(w; < Di < w + e) < max / ———, Cfl ° 1 < 1. 

™eit yC-h + c p Ch + c p J 

Proof. Without any loss of generality, we again take x < x' and note that the 
inclusion (ED tells us that x' < s^. Next, the monotonicity of the restocking 
formula (mi) and the defining relations in (1151) give us that 

Si 7 7 n,i(s) 7 7 n^ii.X ) 7: S 007 

so if e = r ) n ,i[x') — 7 n ,i{x) then one has that 

(1 7 6 — 7 n,i{x ) r y n ,i{x) 7 Soo S 1 • 

Now, if w + e < Soo, then we have the trivial bound 

P(w < Di < w + e) < P (Di < Soo), 

while if w + e > s oc then w > s i and we similarly have 

P(w < Di < w + e) < P (Di > si). 

By the definitions of Si and Sqo, we see from (ll5l) that 

P(A < Soo) = —and P(A > sO = 

Ch H"~ Cp Ch + Cp 

where both probabilities are strictly smaller than one because c < c p and Ch > 0. □ 
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Our Lemmas El and El tell us that for all 1 < * < n we have a uniform bound 
on the contraction coefficient, 

= su p ii k S+i( x ’ •) - K Sh( x '’ ■) Htv < max { „ r r. . _ Cfe +_ c }■ 

x,x'£X f -'h i Lp f-'h i e p 

This tells us that for the minimal ergodic coefficient we have 


Ch 


Cp c 


a n = min {1 — (j(.£Q"+ 1 )} > min { 

i <i<n v ~ l c h + c p / c h +c p 


} > 0 , 


and this bound completes the first step in the proof of Theorem 1 131 


Variance Lower Bound 

Here, as in most stochastic dynamic programs, the value to-go process m can 
be expressed in terms of the value functions that solve the dynamic programming 
recursion (14211 . In particular, at time 1 < * < n, when the current generalized 
inventory is X n< i and there are n — i + 1 demands yet to be realized, one has 

Lr),, i — V n — i +1 (V n ^), 

where the function x i-A v n -i + \(x) is calculated by (l42l) . Moreover, since we start 
with X„ t i = x £ X 1 the definition of v n (x ) gives us 

V n , 1 =v n (x)=E [C„«)l, 

and the martingale decomposition (1161) can be written more simply as 

n 

C n «)-E[C„«)]=^d„, i+1 . 

i =1 

To bound Var[C„( 7 r*)] from below, one then just need to find an appropriate lower 
bound on E [d^ i+1 ] for 1 < i < n. 

For our inventory problem we begin by writing the martingale differences m 
more explicitly as 

(50) d n ^i- j_i = c(^) n ^fiX n ^fij X n ^i) -\- L(X n ^+i) T v n —i(X n ^+ 1 ) v n -i-\-i(X n ^'). 

Next, we introduce the shorthand v n -i{x ) = L(x) + v n -i(x), and we obtain from 
the recursion (H21) and the policy characterization (HU) that 

(51) v n - i+1 (x ) = c(j nii (x) -x)+ E[L( 7 n)i (x) - A)] + E[v n -ifi) n .i(x) - A)]} 

= c(j nti (x) - x) + E[v n -i(j nti (x) - A)]- 

We now replace x with X n ^ in (l5lT) to get a new expression for v n - i+l{X n> i), 
and we replace the last summand of (1501) with this expression. If we recall from 
(I46[) that i = 7 n ,i{X n ^) — D j, then we find after simplification that 

dn,i -{-1 — 'Vn—i(. / yn,i(X n ^i) Df) E[u n _j(y nj 2 (V nj j) Lh) | J~n,i\i 

where, just as before, one has T n .i = cr{X Ut i,X U: 2 , ■ ■ ■ ,X n> i}. This representation 
gives us a key starting point for estimating the second moment of d nj i+i- 

Lemma 16. For the inventory cost C n ( tt*) realized under the mean-optimal policy 
7 r*, there is j. 3 > 0 such that, for all n > 1, one has the variance lower bound 

n 

Var[C„«)]=^E[< i+ 1 ]>/3n. 

2—1 
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Proof. We now let ... ,D' n ) be an independent copy of (Di, £> 2 , ■ • ■, D n ). 

Since X n> i is J-^i-measurable, one then has the further representation 


M d n,i+ 1 I F n ,i\ = -E [{S'„_i(7„ ) i(X„ )i ) - Di) - Vn-i^n^Xn^) - D'f)} 2 \ T n ,f\ 


Next, we consider the set G(X n j) of all ui such that 


Di(ui) £ [bn,?' (X n ,i) Si; Tn,i (X n ,i)] and D^(uj) £ [bn,! {X ni i) 


^1 7 bn,i (-^n,i)] ■ 


In other words, at time i when the generalized inventory begins with X n ^ : one has 
for oj £ G(X n .i ) that either the demand Di(oj) or the demand -D-(w) would cause 
one to order up to the level s ra _.; in period i + 1. 

If we now replace i with i + 1 in the recursion (1511) we see that 

{v n -i{x) -v n -i(y)} 1 {{x,y) £ [0, si] 2 ) = (c + c h )(y - x)l ({x,y) £ [0,si] 2 ) , 

because the two new inventory levels for the next period i + 1 are both given 
by 7 ni j+i(a;) = ^n^+i(y) = s n -i and because one incurs holding costs that are 
proportional to the difference y — x. This last equivalence gives us the lower bound 

E[< i+1 1 Xn,i) > \{c + c h ) 2 E[{D'i - Di} 2 l(G(X n ,i)) \ D n ,i], 

and the expectation on the right-hand side is given by 

r'Y n .i(X n ,i) pnAXn.i) 

/ = / / {u — w} 2 ijj(u)'ip(w) du dw. 

The integrand is non-negative so we can restrict the domain of integration from 
G(X Ut i) to 

2 1 

G (X n _i ) = ['yn,i{X n i ) S\i^/n,i(.X n i) — Si] ^ (X n ,i) — Si, 

to obtain the relaxed lower bound 

rln,i(Xn,i) rjn ,i (X n ) — 2«i/3 

I> / / {u — w} 2 ijj(u)if(w) du dw. 

J'tn,i(Xn.i)si/3 J'ynAX n ,i)si 

One then has the trivial bound 

< w — u for all (it, w) £ G’(X n i ), 

O 

so, in the end, we have 

s 2 9 1 

/>/3 =-p inf {^(w - -si) - ^(w - - -si)} > 0. 

9 j«6[si,soo] 3 3 

where the strict positivity of /3 follows from the fact that dt is continuous and strictly 
increasing on the compact set [0, Soo] C [0, J]. Thus, the infimum is attained and 
strictly positive, so in summary we have 

E[d 2 n . i+ 1 | Xn,i\ > /3 > 0 for all 1 < i < n. 


One then completes the proof of the lemma by taking total expectations and sum¬ 
ming over 1 < i < n. □ 
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State Space Extension: Degeneracy of a Bivariate Chain 

One can write the realized cost (l48l) as an additive functional of a Markov chain 
if one moves from the basic chain {X n i : 1 < * < n + 1} on X to the Markov chain 

(52) { X n ,i = {X nA , X Uti+ i) : 1 <i <n} 

on the enlarged state space X 2 = X x X. The realized cost (1481) then becomes 

n 

(53) c„«) = £ fn,A, i), 

i= 1 

and one might hope to apply iDobrushinl ’s CLT (Theorem 0]) to get the asymptotic 
distribution of C n ( 7 r*). To see why this plan does not succeed, one just needs to 
calculate the minimal ergodic coefficient for the extended chain (1521) . 

For any x, y G X and any B x B' £ B(X 2 ), the transition kernel of the Markov 
chain (l52l) is given by 

2/)) B x B') = P(X n _ i+ i e B, X Uti+2 e B' | X„ t i = x, X nA+ 1 = y) 

= 1 (y S S)P({ 7 „ ii+ i(y) - A+i} G B' \ X n<i+1 = y), 

where 7 n ^(x) is the function defined in (1441) . If we now set B' = X, we have 


K^ 1 ((x,y),BxX) = 


if y G B, 
if y e B c , 


so for y € B and y' G B c we have 

K%U(x,y),B x X) — K^ +1 ((x,y'),B x X) = 1. 

This tells us that the minimal ergodic coefficient of the chain (15^1) is given by 


a n = l- max { sup || K$ +n 

!-*<" (.x,y),(x',y') 


((x,y), ■) ~ K^ +1 ((x r , y'), ■) || TV } = 0, 


and, as a consequence, we see that Dobrushint s classic CLT simply does not apply 
to the sum (l53l) . 

Finally, as one ponders alternative proofs, there is a further possibility that one 
might consider. In Section [2] we noted the possibility of replacing the minimal er¬ 
godic coefficient a n of the Markov chain (15^1) with a potentially less fragile measure 
of dep endence such as the maximal coefficient of correlation p n used by IPeligrad 
( 201211 . For the bivariate chain (l52l) . the maximal coefficient of correlation is given 
by 


p n = max sup 

^ 2<i<n „ 


| E lg{X n ,i) I X n , 


i-lj Il2 . 


: II g( X n,i) || 2 < 00 and E [g{ X n,i)} = 0 > , 


\\g(x n ,i )\\ 2 , 

so for the functional 

g{,X n ,i) = g(X n ,i, X n ,i-\- 1 ) = X n ^i E[A nj i], 
one has p n = 1, and we see that the CLT of lPeligradl ( 2012h does not help us here. 
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Accommodation of Lead Times for Deliveries 

To keep the description of the inventory problem as brief as possible, we have 
assumed that order fulfillment is instantaneous. Nevertheless, in a more realistic 
model, one might want to accommodate the possibility of lead times for delivery 
fulfillments. 

One practical benefit of our “look-ahead” parameter m is that one can allow 
for lead times and still stay within the scope of Theorem [l| We do not need to 
pursue this particular extension here, but it does help to illustrate another way the 
look-ahead parameter can be used. 


9. An Application in Combinatorial Optimization: 

Online Alternating Subsequences 

Given a sequence yi, 7/2 , • ■ •, y n of n distinct real numbers, we say that a subse¬ 
quence i/h , Vi 2 , • • ■, Vi k , 1 < ii < 12 < ■ ■ ■ < ik < n, is alternating provided that the 
relative magnitudes alternate as in 


Com binatorial investigations of alternating subsequences g o back to Euler (Stanley, 
2010l c f.), but probab ilistic in vestigations are mor e recent: Widom (|2006|) , Pem an- 
tle (cf. IStanlevL 2007 . p. 5681. IStanlevI ( 2008h and Houdre and Restreool ( 201fih all 
considered the distribution of the length of the longest alternating subsequence of 
a random permutation or of a sequence {Yi, Y ?,..., Y n } of independent random 
variables with the uniform distribution on [0, 1], There have also been recent ap¬ 
plica tions of this work in computer science (iRomikL 201 ll: iBannister and Eppsteinl . 
2012 . e.g.) and in tests of independence fcf iBrockwell and Davisl . 20061 d. 3121. 

Here we consider alternating subsequences in a sequential , or online , context 
where we are presented with the values Y\, Y 2 ,..., Y n one at the time, and the goal 
is to select an alternating subsequence 


that has maximal expected length. 

A sequence of selection times 1 < n < 72 < ■ • • < t*, < n that satisfy (l5il) 
is called a feasible policy if our decision to accept or reject Y as member of 
the alternating subsequence is based only on our knowledge of the observations 
{Yi, Y 2 ,..., Yi}. In more formal terms, the feasibility of a policy is equivalent to 
requiring that the indices Tfc, k = 1 , 2 , ..., are all stopping times with respect to 
the increasing sequence of a- fields Ai = cr{Yi, Y 2 ,..., Y:}, 1 < i < n. 

We now let n denote the set of all feasible policies, and for 7 r S n, we let A° n (71) be 
the number of alternating selections made by 7r for the realization {Yl, Y2,..., Y„}, 
so 


A°( n) = maxjfc : Y ri < Y T2 > • • ■ sg Y Tk and 1 < n < T 2 < • • • < Tfc < n} . 
We say that a policy 7r* £ n is optimal (or, more precisely, mean-optimal) if 

= SU P E[A°(7t)]. 

7tGII 


Arlott o et, al. (12011 ) found that for each n there is a unique mean-optimal policy 
7 r* such that 

EK«)] = (2-V2)n + 0(1), 
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and it was later found that there is a CLT for A°(7 t*). 


Theorem 17 (CLT for Optimal Number of Alternating Selections). For the mean- 
optimal number of alternating selections A°(-7r*) one has 


A°K)-E[A°(<)] _ ' AUn 
\/Var[A°(7r*)] 


as n —> oo. 


The main goal of this section is to show that Theorem |T] leads to a proof of this 
theorem th at is quicker, mo re rob ust, and more principled than the original proof 


given inlArlottq and Steele ((2014). In the process, we also get a second illustration 


of the ways in which Theorem [Q helps one sidestep the degeneracy that sometimes 
arises when one tries to use iDobr ushinl’ s theorem on a naturally associated bivariate 
chain. In fact, it is this feature of lDobrushinl ’s theorem that initially motivated the 
development of Theorem [T] 


Structure of the Additive Process 

To formulate the alternating subsequence problem as an MDP, we first consider 
a new state space that consists of pairs ( x , s) where x denotes the value of the 
last selected observation and where we set s = 0 if x is a local minimum and set 
s = 1 if x is a local maximum. The decision problem then has a notable reflection 
property: the optimal expected number of alternating selections that one makes 
when k observations are yet to be seen is the same if the system is in state ( x , 0) 
or if the system is in state (1 — x, 1). Earlier analyses exploited this symmetry to 
show that there is a sequence {gt ■ 1 < k < oo} of optimal threshold functions such 
that if one sets X„ t i = 0 and lets 


if b i <C g n —i+ i(A^n,i) 

if b i f g n —i+i (-A n ^), 

then the optimal number of alternating selections has the representation 

n n 

^n( 7r n) = ^ 1 O^i — 9n—i+l{X nt i)) = ^ ' l(AA,i+l ^ X n t j). 

2=1 2=1 


(55) 




n,i+1 — 


Xn,i 

1 -Yi 


The derivation of these relations requires a substantial amount of work, but for 
the purpose of illustrating Theorem [T] and Corollary 0 one does not need to go 
into the details of the construction of these optimal threshold functions. Here it is 
enough to note that this representation for A° (n*) is exactly of the form (ID that 
is addressed by Theorem Q] 

The proof of Theorem [T71 then takes two steps. First, one needs an appropriate 
lower bound for the minimal ergodic coefficients of the chain (1551) . and second one 
needs to check that the variance of A° (n*) goes to infinity as n —> oo. 

T he second pro p erty is almost baked into the cake, and it is even proved in 
Arlotto and Stee]e ( 20141) that Var[A°(7r*)] grows linearly with n. Still, to keep 


our discussion brief, we will not repeat that proof. Instead we focus on the new ■ 
and more strategic — fact that minimal ergodic coefficients of the Markov chains 
(E3 are uniformly bounded away from zero for all 1 < i < n — 2 and all n > 3. 
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A Lower Bound for the Minimal Ergodic Coefficient 

For any x £ [0,1] and any Borel set B C [0,1], the Markov chain (l55l) has the 
transition kernel 

fl 

K i%i( x i B ) = 1( x £ B)g n - i+1 (x) + 1(1 - u € B) du 

" 9n-i-\-l (^) 

= l(x S B)g n _ i+ i(x) + I B D [0,1 - g n -i+ i(x)] |, 

where the hrst summand of the top equation accounts for the rejection of the newly 
presented value Yj = u, and the second summand accounts for its acceptance. 

To obtain a meaningful estimate for the contraction coefficient of K^ +1 we recall 
from the earlier analyses that the optimal threshold functions {<?& : 1 < k < oo} 
have the two basic properties: (i) </*,( x) = x for all x £ [1/3,1] and all k > 1, and 
(ii) gk{x) > 1/6 for all x £ [0,1] and all k > 3. Property (JuJ) and the recursion (l55l) 
give us X n ,j < 5/6 for all 1 < i < n — 2, and we see from property Q that 


s ( K i™+i) = su Pll K il+i( x , ') - •) 11 tv <7 for all 1 < i < n - 2. 

x,x' 0 

This estimate gives us in turn that 


r ( n ) 


i —2 = min {1 -S(K 

l<i<n -2 




so by Corollary [2] we have the CLT for A°_ 2 (7r*). Since A°(7r*) and A°_ 2 (7r*) 
differ by at most 2, this also completes the proof of Theorem [T71 

10. A Final Observation 


Theorem [T] generalizes the classical CLT of Dobrushin ( 1956h . and it offers a pre¬ 
packaged approach to the CLT for the kinds of additive functionals that one meets 
in the theory of finite horizon Markov decision processes. The technology of MDPs 
is wedded to the pursuit of policies that maximize total expected rewards, but 
such policies may not make good economic sense unless the realized reward is “well 
behaved.” While there are several ways to characterize good behavior, asymptotic 
normality of the realized reward is likely to be high on almost anyone’s list. The 
orientation of Theorem [T] addresses this issue in a direct and practical way. 

The examples of Sections [8] and [9] illustrate more concretely what one needs to 
do to apply Theorem [l] In a nutshell, one needs to show that the variance of the 
total reward goes to infinity and one needs an a priori lower bound on the minimal 
coefficient of ergodicity. These conditions are not trivial, but, as the examples show, 
they are not intractable. Now, whenever one faces the question of a CLT for the 
total reward of a finite horizon MDP, there is an explicit agenda that lays out what 
one needs to do. 
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